Exploratory data analysis

MACS 30000 University of Chicago

November 20, 2017

Exploratory data analysis

  1. Generate questions about your data
  2. Search for answers by visualising, transforming, and modeling your data
  3. Use what you learn to refine your questions and or generate new questions
  4. Rinse and repeat until you publish a paper

Exploratory data analysis

  1. What type of variation occurs within my variables?
  2. What type of covariation occurs between my variables?
  3. Are there outliers in the data?
  4. Do I have missingness? Are there patterns to it?
  5. How much variation/error exists in my statistical estimates? Is there a pattern to it?

Differences between EDA and modeling

Tips dataset

Variable Explanation
obs Observation number
totbill Total bill (cost of the meal), including tax, in US dollars
tip Tip (gratuity) in US dollars
sex Sex of person paying for the meal (0=male, 1=female)
smoker Smoker in party? (0=No, 1=Yes)
day 3=Thur, 4=Fri, 5=Sat, 6=Sun
time 0=Day, 1=Night
size Size of the party

Tips regression

##          term estimate std.error statistic  p.value
## 1 (Intercept)  0.20656   0.02492    8.2892 8.65e-15
## 2        sexM -0.00854   0.00835   -1.0234 3.07e-01
## 3   smokerYes  0.00364   0.00850    0.4280 6.69e-01
## 4      daySat -0.00177   0.01834   -0.0967 9.23e-01
## 5      daySun  0.01667   0.01902    0.8764 3.82e-01
## 6      dayThu -0.01818   0.02319   -0.7837 4.34e-01
## 7   timeNight -0.02337   0.02612   -0.8948 3.72e-01
## 8        size -0.00962   0.00422   -2.2824 2.34e-02
##   r.squared adj.r.squared  sigma statistic p.value df logLik  AIC  BIC
## 1     0.042        0.0136 0.0607      1.48   0.175  8    342 -665 -634
##   deviance df.residual
## 1    0.868         236

Exploring tips

Exploring tips

Exploring tips

Exploring tips

EDA vs. CDA

  • Exploratory data analysis
  • Confirmatory data analysis

Measures of central tendency

  • Median
  • Mode
  • Arithmetic mean

    \[\bar{x} = \frac{1}{n} \sum_{i = 1}^n x_i\]

Measures of dispersion

  • Variance

    \[E[X] = \mu\]

    \[\text{Var}(X) \equiv \sigma^2 = E[X^2] - (E[X])^2\]

  • Deviation
  • Standard deviation

    \[\sigma = \sqrt{E[X^2] - (E[X])^2}\]

  • Median absolute deviation

    \[MAD = \text{median}(|X_i - \text{median}(X)|)\]

Histograms

Density estimation

  • Nonparametric density estimation

    \[x_0 + 2(j - 1)h \leq X_i < x_0 + 2jh\]

    \[\hat{p}(x) = \frac{\#_{i = 1}^n [x_0 + 2(j - 1)h \leq X_i < x_0 + 2jh]}{2nh}\]

    \[\hat{p}(x) = \frac{\#_{i = 1}^n [x_0 + 2(j - 1)h \leq X_i < x_0 + 2jh]}{2nh}\]

    \[\hat{p}(x) = \frac{1}{nh} \sum_{i = 1}^n W \left( \frac{x - X_i}{h} \right)\]

    \[W(z) = \begin{cases} \frac{1}{2} & \text{for } |z| < 1 \\ 0 & \text{otherwise} \\ \end{cases}\]

    \[z = \frac{x - X_i}{h}\]

Naive density estimation

Density estimation

  • Kernels

    \[\hat{x}(x) = \frac{1}{nh} \sum_{i = 1}^k K \left( \frac{x - X_i}{h} \right)\]

Gaussian kernel

\[K(z) = \frac{1}{\sqrt{2 \pi}}e^{-\frac{1}{2} z^2}\]

Rectangular (uniform) kernel

\[K(z) = \frac{1}{2} \mathbf{1}_{\{ |z| \leq 1 \} }\]

Triangular kernel

\[K(z) = (1 - |z|) \mathbf{1}_{\{ |z| \leq 1 \} }\]

Quartic (biweight) kernel

\[K(z) = \frac{15}{16} (1 - z^2)^2 \mathbf{1}_{\{ |z| \leq 1 \} }\]

Epanechnikov kernel

\[K(z) = \frac{3}{4} (1 - z^2) \mathbf{1}_{\{ |z| \leq 1 \} }\]

Comparison of kernels

Selecting the bandwidth \(h\)

Selecting the bandwidth \(h\)

\[h = 0.9 \sigma n^{-1 / 5}\]

\[A = \min \left( S, \frac{IQR}{1.349} \right)\]

Boxplot

Violin plot

Things to look for in continuous variables

  • Assymetry
  • Outliers
  • Multimodality
  • Gaps
  • Heaping
  • Rounding
  • Impossibilities
  • Errors

Galton’s heights

Investigate for gaps or heaping

Comparing the distributions

Comparing the distributions

Comparing the distributions

Outlier detection

Outlier detection

Outlier detection

## # A tibble: 3 x 24
##                                              title  year length budget
##                                              <chr> <int>  <int>  <int>
## 1                           Cure for Insomnia, The  1987   5220     NA
## 2                                       Four Stars  1967   1100     NA
## 3 Longest Most Meaningless Movie in the World, The  1970   2880     NA
## # ... with 20 more variables: rating <dbl>, votes <int>, r1 <dbl>,
## #   r2 <dbl>, r3 <dbl>, r4 <dbl>, r5 <dbl>, r6 <dbl>, r7 <dbl>, r8 <dbl>,
## #   r9 <dbl>, r10 <dbl>, mpaa <chr>, Action <int>, Animation <int>,
## #   Comedy <int>, Drama <int>, Documentary <int>, Romance <int>,
## #   Short <int>

Filter outliers

Compare distributions of subgroups

Compare distributions of subgroups

Multiple windows plot

Multiple windows plot

Boxplot

Categorical variables

  • Discrete variables with a fixed set of possible values

Bar chart

Omitted categories

Order matters

Order matters

Order matters

Order matters

Order matters

Variations on bar charts

Stacked bar chart

Dodged bar chart

Proportional bar chart

Pie chart

Pie chart

  • Based on angles, arc lengths, and areas
  • People misjudge/misestimate area

Pie chart experiment

A B C D
10 20 40 30

Pie chart takeaways

  • For pure magnitude identification, bar charts are superior
  • For comparing percentages, either chart is acceptable
  • To compare combinations of groupings, pie charts are slightly superior
  • Tables are only useful if you want to report exact percentages

Scatterplots

  • Causal relationships (linear and nonlinear)
  • Associations (correlations)
  • Outliers or groups of outliers
  • Clusters
  • Gaps
  • Barriers
  • Conditional relationships

movies example

Smoothing lines

Adding jitter to the graph

Adding jitter to the graph

Adding jitter to the graph

Comparing groups within scatterplots

Comparing groups within scatterplots

Comparing groups within scatterplots

Comparing groups within scatterplots

Comparing groups within scatterplots

Scatterplot matrix

Scatterplot matrix

Scatterplot matrix

Heatmap of correlation coefficients

Parallel coordinate plots

Documenting EDA

  • Why document it
  • EDA notebook
    • Records what you did and why you did it
    • Supports rigorous thinking
    • Help others understand your work

Documenting EDA

  • Give it a title, meaningful filename, and a first paragraph that describes the aims of the analysis
  • Don’t delete anything
  • Programmatically transform and update your data
  • Regularly compile/knit the notebook from scratch
  • Git tracking

Formats

  • Jupyter Notebooks
    • Python
    • R
    • Julia
    • Scala
    • Plus 40 more languages
  • R Markdown
    • R
    • Minimal support for other languages

Example notebooks

Python

R